Most time in data analysis is spent ‘tidying up’ data: getting it into a suitable format to get started. Data scientists have a particular definition of tidy: Tidy datasets are “easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row” (Wickham 2014).

It’s often not convenient for humans to enter data in a tidy way, so untidy data is probably more common than tidy data in the wild. But doing good, reproducible science demands that we document each step of our processing in a way that others can check or repeat in future. Tools like R make this easier.

Overview

In previous worksheets (e.g. here and here) we used various commands in the tidyverse, like filter and group_by.

If you want to recap these commands you can start to use the cheatsheet, especially the part on groups and summaries.

Today we will cover three additional techniques which are important when working with real datasets:

1, Creating new variables/columns 2. ‘Pivoting’ or reshaping data from long to wide formats (or the reverse) 3. Joining two sources of data (e.g. two spreadsheets) into a single dataframe

Before you start

  • Make sure you complete the worksheet on exploring data and plotting
  • Create a new R Script file to save your workings
  • Remember to load the tidyverse
library(tidyverse)

Making new variables

Sometimes we need to create new columns in our dataset.

For example, let’s say we wanted to calculate someone’s body mass index (BMI) from their weight and height.

There is a built-in dataset called women, which contains heights and weights of 15 women in lbs and inches:

women  %>% glimpse
> Rows: 15
> Columns: 2
> $ height <dbl> 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72
> $ weight <dbl> 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 15…

Explanation of the output We used glimpse in the block above. Glimpse is a bit like head, except it shows you a list of variables, with as many examples of the data as will fit on screen. The head command is the other way around — it show columns along the top, and the first few rows of data vertically. Either can be useful to quickly check what is in a dataset, but I often prefer glimpse if there are a lot of variables in a file because all of the column names are shown.


To calculate a BMI we first need to convert the heights to from inches to meters, and the weights from lbs to kilograms.

metric_women <- women  %>%
  mutate(
    height_m = height*0.0254,  # approx conversion from inches to m
    weight_kg= weight*0.45     # conversion from lbs to kg
  )

Explanation: We used the mutate function to covert lbs to kg and inches to m, and saved these two new columns in a new dataset called metric_women.

We can see the new columns here:

metric_women %>% head(3)
>   height weight height_m weight_kg
> 1     58    115   1.4732     51.75
> 2     59    117   1.4986     52.65
> 3     60    120   1.5240     54.00

BMI is calculated as \(\dfrac{kg}{m^2}\). We can use mutate again to create a new column to store this calculation:

metric_women  %>%
  mutate(BMI = weight_kg / height_m^2) %>%
  head(3)
>   height weight height_m weight_kg      BMI
> 1     58    115   1.4732     51.75 23.84443
> 2     59    117   1.4986     52.65 23.44374
> 3     60    120   1.5240     54.00 23.25005

Explanation: We used mutate again to make a new column, BMI. This contains women’s weight divided by their squared height (^2 means to the power of 2, or squared, in R-speak).

Create a density plot of BMI scores in the women dataset. It should look like this:

What is the median BMI in the sample?

Hints:

  • You will need to use mutate to create a new column containing BMI scores
  • When you create a dataset with this new column, either save the result with a new variable name and pipe that to ggplot, or pipe the result directly into ggplot. Either approach is fine.
  • We covered density plots in the first worksheet, and they are also in the cheatsheet.

Pivoting (reshaping)

Data is commonly stored in either wide or long format.

If you used SPSS to do a t-test or ANOVA during your undergraduate degree, you likely stored and analysed the data in wide format.

In wide format, each row represents the observations from a single participant. Each measurement for a given participant are stored in separate columns.

This is often called row per subject data. An example is the built in attitude dataset:

attitude %>%
  head()
>   rating complaints privileges learning raises critical advance
> 1     43         51         30       39     61       92      45
> 2     63         64         51       54     63       73      47
> 3     71         70         68       69     76       86      48
> 4     61         63         45       47     54       84      35
> 5     81         78         56       66     71       83      47
> 6     43         55         49       44     54       49      34

Explanation: Each row contains scores for a particular employee on various measures. To find out more about these data you can type ?attitude into the console.


Let’s say we want a single plot of all these variables, something like this:

To do this we first need to convert the data to long format. In long format, each observation is saved in its own row, rather than across multiple columns.

It’s often called “row per observation” data.


Pivoting is where you take a long data file (lots of rows, few columns) and make it wider. Or where you take a wide data file (lots of columns, few rows) and make it longer.

We can convert from wide to long using the pivot_longer command.

Another term sometimes used for pivoting data to long form is melting it. img: TrueWarrior


To see why the command is called ‘pivot_longer’, imagine trying to reshape just the first two rwos of the attitude dataset:

>   rating complaints privileges learning raises critical advance
> 1     43         51         30       39     61       92      45
> 2     63         64         51       54     63       73      47

If we use pivot_longer on this selection, we end up with this:

attitude %>%
  pivot_longer(everything())
> # A tibble: 210 x 2
>    name       value
>    <chr>      <dbl>
>  1 rating        43
>  2 complaints    51
>  3 privileges    30
>  4 learning      39
>  5 raises        61
>  6 critical      92
>  7 advance       45
>  8 rating        63
>  9 complaints    64
> 10 privileges    51
> # … with 200 more rows

Explanation of the command: We selected a subset of columns and rows. Then we used pivot_longer(everything()) to make this into long form data. The everything() tells R to merge values from all of the columns into a single new column called value, and to keep track of the original variable name in a new column called name)

The change works like like this:

Converting from wide format to long format


One problem with this example we don’t have a record of which participant was which in the attitude dataset, because the mapping to participants was implicit: each row was a new participant, and participant number was not recorded in the file.

We can create an explicit participant identifier by adding a new column. For this we use the mutate and row_number() commands:

attitude_with_person <- attitude %>%
  mutate(person = row_number()) %>%
  head(2)

attitude_with_person
>   rating complaints privileges learning raises critical advance person
> 1     43         51         30       39     61       92      45      1
> 2     63         64         51       54     63       73      47      2

Now we have a column called person which stores the row nuber.

But this means if we pivot_longer again, we will need to tell R which columns we would like to pivot.

If we don’t do this then the person column gets melted with everything else so we lose track of which response belonged to which participant:

attitude_with_person %>%
  pivot_longer(everything())
> # A tibble: 16 x 2
>    name       value
>    <chr>      <dbl>
>  1 rating        43
>  2 complaints    51
>  3 privileges    30
>  4 learning      39
>  5 raises        61
>  6 critical      92
>  7 advance       45
>  8 person         1
>  9 rating        63
> 10 complaints    64
> 11 privileges    51
> 12 learning      54
> 13 raises        63
> 14 critical      73
> 15 advance       47
> 16 person         2

Explanation of the output Because we didn’t tell pivot_longer which columns we wanted to pivot, it put all the values into a single new column called value. This included our participant identifier, person which is not what we wanted.


We can exclude person from the pivoting by writing:

attitude_with_person %>%
  pivot_longer(-person) %>% 
  head()
> # A tibble: 6 x 3
>   person name       value
>    <int> <chr>      <dbl>
> 1      1 rating        43
> 2      1 complaints    51
> 3      1 privileges    30
> 4      1 learning      39
> 5      1 raises        61
> 6      1 critical      92

Explanation of the command and output: Here, we still use pivot_longer but this time we put -person between the parentheses. The minus sign, -, means don’t include this variable, so -person ends up meaning include all columns except person, which is what we wanted. The output now retains the person column, but pivots the other variables. This means we can tell which person provided each datapoint.

Use some tidyverse commands you already know, plus pivot_longer, to produce this plot using the attitude dataset:

  • Check the cheatsheet if you are not sure how to add a boxplot layer (hint, it starts with geom_, so you might be able to guess or autocomplete it without looking it up)
  • You need to select only the three variables shown first
  • It’s not necessary to creat a person identifier for this plot (although it won’t hurt if you do)

Pivoting data to make summaries

Imagine we want a table of the mean score for each question in the attitude dataset.

This would be fiddly if we just tried to use summarise on wide format data. But if we use pivot_longer, group_by and then summarise (in that order) it’s possible to take the data and make a table like this with 3 instructions to R:

> # A tibble: 7 x 3
>   Name        Mean    SD
>   <chr>      <dbl> <dbl>
> 1 advance     42.9 10.3 
> 2 complaints  66.6 13.3 
> 3 critical    74.8  9.89
> 4 learning    56.4 11.7 
> 5 privileges  53.1 12.2 
> 6 raises      64.6 10.4 
> 7 rating      64.6 12.2

Combine the pivot_longer, group_by and summarise commands (in that order) to reproduce the table above.

Hints:

  • You want to pivot all of the variables in the attitude dataset this time
  • We covered using summarise in the first and second worksheets. If you want a reminder, the cheatsheet might be the quickest place to look

Pivoting wider

Sometimes we have the opposite problem: We have long data, but want it in wide format. For example, we might want a table where it’s easy to compare between different years, like this:

gapminder::gapminder %>%
  filter(year > 1990) %>% 
  pivot_wider(id_cols=country, 
              names_from=year, 
              values_from=gdpPercap) %>% 
  head(3) %>% 
  pander::pander("GDP per-capita in 3 countries in 3 different years, from the gapminder dataset.")
GDP per-capita in 3 countries in 3 different years, from the gapminder dataset.
country 1992 1997 2002 2007
Afghanistan 649.3 635.3 726.7 974.6
Albania 2497 3193 4604 5937
Algeria 5023 4797 5288 6223

Instead of making the data longer, now we want to pivot_wider.

As we saw before, the gapminder data is a fairly long format. There are multiple rows per-country corresponding to different years.

Let’s say we want to compare GDP in different years. We first need to select the data we want — country, year and GDP:

gapminder1990s <- gapminder::gapminder %>%
  select(country, year, gdpPercap) %>%
  filter(year >= 1990)

Then we pivot_wider:

gapminder1990s %>%
  pivot_wider(names_from = year, values_from = gdpPercap) %>% 
  head()
> # A tibble: 6 x 5
>   country     `1992` `1997` `2002` `2007`
>   <fct>        <dbl>  <dbl>  <dbl>  <dbl>
> 1 Afghanistan   649.   635.   727.   975.
> 2 Albania      2497.  3193.  4604.  5937.
> 3 Algeria      5023.  4797.  5288.  6223.
> 4 Angola       2628.  2277.  2773.  4797.
> 5 Argentina    9308. 10967.  8798. 12779.
> 6 Australia   23425. 26998. 30688. 34435.

Explanation of the command and output: We started with multiple rows per country, corresponding to years. We used pivot_wider with names_from = year to create new columns for each year in the data. We used values_from=gdpPercap to tell pivot_longer to use the GDP numbers to populate the table. The resulting table helps us compare years within countries, or between countries for a given year.

Experiment for yourself with pivot_longer and pivot_wider:

  • Try using pivot_longer with some different datasets, perhaps iris and mtcars
  • Use the pivoted data with group_by to create summary tables

If you have time, now try the extension exercises

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.







All content on this site distributed under a Creative Commons licence. CC-BY-SA 4.0.